Subway station data desciption

The subway station data is used to find out if there is relationship between hostel locations and subway station locations.

NYsub_raw <- NYsub_raw %>%
  mutate(
    lat=str_extract(NYsub_raw$the_geom,"\\s[0-9,.]+") %>% as.numeric(),
    lon=str_extract(NYsub_raw$the_geom,"-[0-9,.]+") %>% as.numeric()
    )
head(NYsub_raw)
##                                 URL OBJECTID             NAME
## 1 http://web.mta.info/nyct/service/        1         Astor Pl
## 2 http://web.mta.info/nyct/service/        2         Canal St
## 3 http://web.mta.info/nyct/service/        3          50th St
## 4 http://web.mta.info/nyct/service/        4        Bergen St
## 5 http://web.mta.info/nyct/service/        5 Pennsylvania Ave
## 6 http://web.mta.info/nyct/service/        6         238th St
##                                       the_geom          LINE
## 1 POINT (-73.99106999861966 40.73005400028978) 4-6-6 Express
## 2 POINT (-74.00019299927328 40.71880300107709) 4-6-6 Express
## 3 POINT (-73.98384899986625 40.76172799961419)           1-2
## 4 POINT (-73.97499915116808 40.68086213682956)         2-3-4
## 5 POINT (-73.89488591154061 40.66471445143568)           3-4
## 6 POINT (-73.90087000018522 40.88466700064975)             1
##                                                                    NOTES
## 1 4 nights, 6-all times, 6 Express-weekdays AM southbound, PM northbound
## 2 4 nights, 6-all times, 6 Express-weekdays AM southbound, PM northbound
## 3                                                  1-all times, 2-nights
## 4                               4-nights, 3-all other times, 2-all times
## 5                                            4-nights, 3-all other times
## 6                                      1-all times, exit only northbound
##        lat       lon
## 1 40.73005 -73.99107
## 2 40.71880 -74.00019
## 3 40.76173 -73.98385
## 4 40.68086 -73.97500
## 5 40.66471 -73.89489
## 6 40.88467 -73.90087
summary(NYsub_raw)
##      URL               OBJECTID         NAME             the_geom        
##  Length:473         Min.   :  1.0   Length:473         Length:473        
##  Class :character   1st Qu.:119.0   Class :character   Class :character  
##  Mode  :character   Median :237.0   Mode  :character   Mode  :character  
##                     Mean   :238.1                                        
##                     3rd Qu.:355.0                                        
##                     Max.   :643.0                                        
##      LINE              NOTES                lat             lon        
##  Length:473         Length:473         Min.   :40.58   Min.   :-74.03  
##  Class :character   Class :character   1st Qu.:40.68   1st Qu.:-73.98  
##  Mode  :character   Mode  :character   Median :40.72   Median :-73.95  
##                                        Mean   :40.73   Mean   :-73.94  
##                                        3rd Qu.:40.78   3rd Qu.:-73.90  
##                                        Max.   :40.90   Max.   :-73.76
# price (extreme value)
ggplot(airbnb,mapping=aes(price)) + 
  geom_density(kernel = "gaussian")+
  theme_classic() +
  theme(legend.position="top")+
  ggtitle("Price distribution (All data)")

# review
ggplot(airbnb,mapping=aes(number_of_reviews)) + 
  geom_histogram(binwidth = 5)+
  theme_classic() +
  theme(legend.position="top")+
  ggtitle("Review distribution (All data)")

As seen in the plots - Price distribution (All data) & Price distribution (All data) , the ranges of price and review are large there are many extreme values.

# map
pal <- colorFactor(palette = "Dark2",domain=airbnb$neighbourhood_group)

ab_map <- leaflet() %>% 
  setView(lng = -73.9, lat = 40.73, zoom = 10) %>% 
  addProviderTiles(providers$Esri.OceanBasemap) %>% 
  addCircleMarkers(data=airbnb,
             lng=~longitude,
             lat=~latitude,
             popup = ~name,
             radius=2,
             color=~pal(neighbourhood_group),
             stroke=FALSE,
             fillOpacity = 0.5
             ) %>% 
  addCircleMarkers(data=NYsub_raw,
             lng=~lon,
             lat=~lat,
             popup = ~NAME,
             radius=1,
             color="black",
             stroke=1,
             fillOpacity = 1
             )%>% 
  addLegend(data=airbnb,"bottomright", pal =pal, 
            values = ~neighbourhood_group,
            title = "NYC Airbnb Location<br>by Neighbourhood",
            opacity = 1
            ) %>% 
  addLegend(data=NYsub_raw,"topright", 
  colors =c("#000000"),
  labels= c("Subway station"),
  title= "NYC Subway Locations",
  opacity = 1)

ab_map

The map of hostels

renamed_cor<-airbnb %>% rename("ID"=id, "Name"=name, "Host ID"=host_id, "Host Name"=host_name, "Neighbourhood Group"=neighbourhood_group,
                               "Neighbourhood"=neighbourhood,"Latitude"=latitude, "Longitude"=longitude, "Room Type"=room_type,
                               "Price"=price, "Minimum Nights"=minimum_nights, "No. of Reviews"=number_of_reviews, "Last Review"=last_review,
                               "Reviews Per Month"=reviews_per_month,"No. of Listings"=calculated_host_listings_count, "Availability"=availability_365)

airbnb_cor <- renamed_cor[, sapply(renamed_cor, is.numeric)]
airbnb_cor <- airbnb_cor[complete.cases(airbnb_cor), ]
correlation_matrix <- cor(airbnb_cor, method = "spearman")
corrplot(correlation_matrix, method = "square",order = "alphabet",tl.cex =0.7,tl.col = "black",tl.srt = 45,cl.cex=0.7)

Obviously, Price and location of the stay have the strongest negative relation marked in dark orange as shown. Availability and the reviews per month, and no. of listings with availability reveal positive relation. The possible reasons may be a sizable host offers better stay and is more approachable to travelers.

ggplot(airbnb,aes(x = neighbourhood_group)) + 
  geom_bar(aes(fill= neighbourhood_group))+ 
  scale_fill_manual(values=c("#FFFF00", "#66CC33", "#006666", "#003366", "#660033"))+
  geom_text(stat = 'count',aes(label =..count.., vjust=-0.3))+
  labs(title="Number of Listings vs Neighbour Group", x="Neighbourhood Group", y = "Number of listings")+
  theme_minimal()

In the Neighbourhood Group, Manhattan and Brooklyn have the greatest number of listing with 21,661 and 20,104 respectively, while Staten Island has the lowest number of 373 only.

review_pie<-airbnb%>%group_by(neighbourhood_group)%>% summarise(Total_review1 =sum(number_of_reviews, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
sumreview <- sum(review_pie$Total_review1)
review_pie$Total_review <- review_pie$Total_review1 * 100 / sumreview 
ggplot(review_pie, aes(x = "", y = Total_review, fill = neighbourhood_group)) +
  geom_bar(width = 1, stat = "identity")+ 
  coord_polar("y", start = 0)+ 
  scale_fill_brewer(palette = "Blues")+
  geom_text(aes(label = paste0(round(Total_review), "%")), position = position_stack(vjust = 0.5), size=3)+
  labs(title="Total Reviews vs Neighbour Group", x=element_blank(), y=element_blank())

Taking No. of review and neighbourhood_group from the dataset, it revealed that the neighbourhood Brooklyn (43%), Manhattan(40%) and Queens (14%) received the largest number of reviews accordingly. It is believed that Brooklkyn is the most popular region for Airbnb stay.

No. of reviews vs neighbourhood & neighbourhood group

circular_bar<-airbnb%>%group_by(neighbourhood,neighbourhood_group)%>% summarise(Total_review =sum(number_of_reviews, na.rm = TRUE))
## `summarise()` regrouping output by 'neighbourhood' (override with `.groups` argument)
circular_bar<-cbind(circular_bar, id=c(1:221))
circular_bar<-arrange(circular_bar,desc(Total_review))
circular_bar<-circular_bar[1:30, ]
circular_bar<-cbind(circular_bar, id2=c(1:30))
circular_bar<-arrange(circular_bar,desc(id))
label_data <- circular_bar
number_of_bar <- nrow(label_data)
angle <-90 - 360 * (label_data$id2-0.5) /number_of_bar
label_data$hjust<-ifelse(angle < -90, 1, 0)
label_data$angle<-ifelse(angle < -90, angle+180, angle)
ggplot(circular_bar, aes(x=as.factor(id), y=log2(Total_review), fill=neighbourhood_group)) + 
  geom_bar(stat="identity", alpha=0.5) +
  ylim(-5,30) +
  theme_minimal() +
  theme(
    axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid = element_blank(),
    plot.margin = unit(rep(-1,4), "cm")
  ) +
  coord_polar(start = 0)+
  geom_text(data=label_data, aes(x=id2, y=log2((Total_review))+3.5,
                                 label=neighbourhood, hjust=hjust), 
            color="black",fontface="bold",alpha=0.8, size=2.6, 
            angle= label_data$angle, inherit.aes = FALSE )

Brooklyn and Manhattan got the most number of reviews as compared with Queens and the two others. For instance, sub-districts namely Williamsburg in Brooklyn and Washington Heights in Manhattan are especially representative in large number of reviews.

#Facets of Price vs Nos. of Listings of 5 neighbourhood_group 
ggplot(data = airbnb) + 
  geom_point(mapping = aes(x = price, y = calculated_host_listings_count,colour = "#F38434")) + 
  facet_wrap(~ neighbourhood_group, nrow = 2)+
  labs(title="Number of listings vs Price per Neighbour Group", x="Price", y = "Number of listings")

Facet Warp charts are used to show the relation between Price and various Neighbour Groups. Significant price difference is reflected in Manhattan representing pricey stay in the region. Alternatively, host offers in Bronx and Staten Island are rather much cheaper, probably to be the grimy regions in New York.

ggplot(airbnb, aes(x=neighbourhood_group, y=log10(price), fill=neighbourhood_group)) + 
  geom_jitter(aes(colour=neighbourhood_group, alpha=0.5)) +
  geom_boxplot(alpha=0.3, outlier.colour = "black", outlier.shape = 1, notch = TRUE) +
  theme(legend.position="none")+
  labs(title="Price vs Neighbour Group", x="Neighbourhood Group", y = "Price (log10)")
## Warning: Removed 11 rows containing non-finite values (stat_boxplot).

On average, the Mean, Median, 1st and 3rd quantiles of hostel price in Manhattan lead the others dominantly as tallied with the previous chart too. The phenomenon may be regarded to the expensive consumption level in Manhattan, the core district in New York.

airbnb %>%
  ggplot(aes(x=neighbourhood_group ,fill =room_type))+
  labs(title = "Proportion of room type in different neighbourhood group",
       x = 'Neighbourhood group',
       y = 'Proportion')+
  geom_bar(position = 'fill')+
  theme_classic()

The reason of higher hostel price in Manhattan is due to the more portion of entire home compared with other nighbourhood group.

airbnb %>% group_by(room_type) %>%ggplot(aes(x=room_type))+
  labs(title = "Distribution of room type",
       x = "Room Type",
       y= "Count")+
  geom_bar(aes(fill=room_type),fill=colors)

Three room types are provided by the hosts with half shared by Entire home/apt. The penetration of Shared room is the least.

Price of various room types

ylim1 = boxplot.stats(airbnb$price)$stats[c(1, 5)]
airbnb%>%
  group_by(room_type)%>%
  ggplot(aes(x=room_type, y=price),fill=colors)+
  labs(title = "Compariosns among room type with price",
       x= "Room Type",
       y= "Price")+
  geom_violin(aes(fill=room_type))+
  scale_fill_manual(values=colors)+
  coord_cartesian(ylim = ylim1)+
  theme_fivethirtyeight()+
  theme(axis.title = element_text())

The distribution of price for Entire home is more diverse and is also more expensive. Rather, Shared room is the cheapest among them.

airbnb%>%
  ggplot(aes(x=availability_365))+
  labs(title = "Availability of all hostel",
       y="Frequency",
       x="Number of days"
  )+
  stat_bin(geom = "path" , pad = FALSE)+
  theme_fivethirtyeight()+
  theme(axis.title = element_text())
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The graph depicts that most hostels are available in days below 100 days. It is also discovered that there exists outlier of zero availability (availability_365 = 0) ~39% in the graph that is worth having a deeper investigation.

ZERO availability Average Price: $136 (Cheaper, middle class) vs All average price $153

colors=c("#F3BD5E", "#164597")
fig <- combinedAvailability %>% plot_ly(labels=~name, values = ~count,marker = list(colors = colors,
                                                                      textposition = 'inside',
                                                                      textinfo = 'label+percent',
                                                                      insidetextfont = list(color = '#FFFFFF'),
                                                                      hoverinfo = 'text'
                                                                      ))
fig <- fig %>% add_pie(hole = 0.45, showlegend=TRUE)
fig <- fig %>% layout(title = "Distribution of hostel with zero Availability",  showlegend = T,
                      xaxis = list(showgrid = TRUE, zeroline = FALSE, showticklabels = FALSE),
                      yaxis = list(showgrid = TRUE, zeroline = FALSE, showticklabels = FALSE))
fig
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

The larger share depicted that the majority of availability more than one day (64.1%) but with zero availability (35.9%) also have a significant share.

colors <- c('#0000cc','#00b35c', '#ffcf66')
fig <- plot_ly(roomtype, labels = ~room_type, values = ~count, type = 'pie',
               textposition = 'inside',
               textinfo = 'label+percent',
               insidetextfont = list(color = '#FFFFFF'),
               hoverinfo = 'text',
               text = ~paste('$', count, ' billions'),
               marker = list(colors = colors,
                             line = list(color = '#FFFFFF', width = 1.5)),
               #The 'pull' attribute can also be used to create space between the sectors
               showlegend = FALSE)
fig <- fig %>% layout(title = 'Room type distribution of zero availability hotels',
                      xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
                      yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
fig

Aligned with the overall picture, the share of room type is similar, presenting half of them are Entire home/apt (50.6%), followed by Private room (47.7%) and the least penetration of Shared room (1.69%).

p <- ggplot(zero,aes(x=room_type, y=log10(price)), fill='neighbourhood_group') +
  geom_jitter(aes(colour=neighbourhood_group, alpha=0.5)) +
  geom_hline(yintercept=0) +
  labs(title = "Price distribution of ZERO availablity by different room types and neighbourdhood area",
       x = "Room Type",
       y = "Price(log)")
fig <- ggplotly(p,width = 800, height = 600)
fig

The distribution between price and various room types under zero availability demonstrated that staying in Manhattan is pretty much more pricey than other regions. Yet, the price distribution of shared room is not obvious.

fig <- plot_ly(zero, x = ~price, y = ~number_of_reviews, z = ~reviews_per_month,
               color = ~room_type, colors= c('#f7a76e', '#6BA1C4','#ff0066'),
               width = 800, height = 600)
fig <- fig %>% add_markers()
fig <- fig %>% layout(scene = list(xaxis = list(title = 'price'),
                                   yaxis = list(title = 'number_of_reviews'),
                                   zaxis = list(title = 'reviews_per_month')),
                      annotations = list(
                        x = 1.13,
                        y = 1.05,
                        showarrow = FALSE
                      ))
fig
## Warning: Ignoring 4841 observations

The chart tells more expensive hostels tend to have fewer reviews or none. Rationally, competitively priced hostels more popularity to have more reviews. Price is demonstrated to be the most crucial incentive in motivating customers to write a review. By ranking the hostels in the descending order of number of reviews, the average price of the top 100 hostel is $89 while the bottom 100 hostel is $116. Comparatively, Private room dominates the markets as it exhibits more reviews than the other two room types. Insightfully, some rooms might be booked by the hostel official reservation system such that those hostels were always full and need not to be listed online for rental. This information is crucial for the hotel management people and owners to know the market, and then conduct strategic plans.

Data cleaning

# NYC Airbnb
NYab <- airbnb %>% select(-c(host_id,host_name,
                               last_review,
                               reviews_per_month,
                               calculated_host_listings_count))

# only take Q3-Q1 data in NYab and remove Staten Island data
sum_price<-summary(NYab$price)

NYab <- NYab %>% 
  filter(price<=sum_price[5] & price >=sum_price[2] & neighbourhood_group!="Staten Island")

# Price distribution
ggplot(NYab,mapping=aes(price)) + 
  geom_histogram(binwidth = 5,color="black", fill="lightblue")+
  scale_color_grey() + 
  theme_classic() +
  theme(legend.position="top")+
  ggtitle("Price distribution (Only IQR)")

# NYC subway
NYsub <- NYsub_raw %>% select(-c(URL,
                               the_geom))


# Initialize extra col
NYab <- NYab %>% mutate(station_dis=rep(0,length(NYab$id)),
                        closest_station=rep(0,length(NYab$id),),
                        station_ID=rep(0,length(NYab$id))
)

NYsub <- NYsub %>% mutate(No_lines=str_count(NYsub$LINE,"[^-]+"))

summary(NYsub)
##     OBJECTID         NAME               LINE              NOTES          
##  Min.   :  1.0   Length:473         Length:473         Length:473        
##  1st Qu.:119.0   Class :character   Class :character   Class :character  
##  Median :237.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :238.1                                                           
##  3rd Qu.:355.0                                                           
##  Max.   :643.0                                                           
##       lat             lon            No_lines    
##  Min.   :40.58   Min.   :-74.03   Min.   :1.000  
##  1st Qu.:40.68   1st Qu.:-73.98   1st Qu.:1.000  
##  Median :40.72   Median :-73.95   Median :2.000  
##  Mean   :40.73   Mean   :-73.94   Mean   :1.877  
##  3rd Qu.:40.78   3rd Qu.:-73.90   3rd Qu.:2.000  
##  Max.   :40.90   Max.   :-73.76   Max.   :5.000

In order to have accurate analysis, extreme values and unused data are removed.

Calculate hostels and closest subway station distance

# Calculate closest subway station
for (i in 1:length(NYab$id)){
  dis_temp <- rep(0,length(NYsub$OBJECTID))
  for (j in 1:length(NYsub$OBJECTID)){
    dis_temp[j]=distance(ang2rad(NYab$latitude[i]),ang2rad(NYab$longitude[i]),ang2rad(NYsub$lat[j]),ang2rad(NYsub$lon[j]))
  }
  NYab$station_dis[i]=min(dis_temp)
  NYab$closest_station[i]=NYsub$NAME[which.min(dis_temp)]
  NYab$station_ID[i]=NYsub$OBJECTID[which.min(dis_temp)]
}

(sum_dis <- summary(NYab$station_dis))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002372 0.204155 0.314456 0.411766 0.460448 8.063417

Distance from hostels and the relative closest subway station are calculated using above function. As we can see from the summary, there are outliers as well. The greatest distance from hostel to its closest station is 8.063417 Km.

Distance data summary

# Remove data (distance > Q3)
NYab_dis <- NYab %>% filter(station_dis<=sum_dis[5]) 

# Visualization
ggplot(NYab_dis,mapping=aes(station_dis)) +
  geom_histogram(binwidth = 0.02)+
  scale_color_grey() +
  theme_classic() +
  theme(legend.position="top")+
  ggtitle("Distance from hostels to \nits closest subway station ")

# ggplot(NYab_dis,mapping=aes(x=station_ID))+
#   geom_bar()

Since there are extreme values in the distance as well, those values are removed to ensure the analysis is accurate.

Compare price and distance from subway

# Scatter plot by room type (price vs distance from subway)
price_dis_rmtp <- NYab_dis %>% ggplot(mapping=aes(y=station_dis,x=price,color=room_type))+
  geom_point()+
  geom_smooth(se=F)+
  ggtitle("Scatter Plot of price (by room type) \nand distance from closest subway station") +
  xlab("Price") + ylab("Distance(Km)")
price_dis_rmtp
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

As shown in the scatter plot, there is no obvious relationship between distance from it closest station and the price.

# Calculate number of hostel near each subway
nb_bnb <- NYab_dis %>% group_by(closest_station) %>% summarise(n=n())
## `summarise()` ungrouping output (override with `.groups` argument)
nb_bnb <- rename(nb_bnb,NAME=closest_station)
NYsub <- NYsub %>% left_join(nb_bnb,by="NAME")
NYsub <- NYsub %>% rename(No_bnb=n)

# Sort subway station by no of nearby bnb
NYsub_bnb <- NYsub %>% arrange(desc(No_bnb))

# Bubble on map station by no of bnb
mybins <- seq(0, 500, by=100)
palette_rev <- rev(brewer.pal(5, "Spectral"))
mypalette <- colorBin( palette=palette_rev,
                       domain=NYsub$No_bnb, na.color="transparent", bins=mybins)

sub_bb_map <- NYsub %>% leaflet() %>% 
  setView(lng = -73.9, lat = 40.73, zoom = 11) %>% 
  addProviderTiles(providers$Esri.OceanBasemap) %>% 
  addCircleMarkers(~lon, ~lat, 
    fillColor = ~mypalette(No_bnb), fillOpacity = 0.7, color="white", radius=3, stroke=FALSE
  ) %>%
  addLegend( pal=mypalette, values=~No_bnb, opacity=0.9, title = "No. of nearby hostels<br>by stations", position = "bottomright" )
sub_bb_map

There are many hostels located close to subway stations in Manhattan as shown in above map. Therefore, we conclude that the most crowded area is around the central park.

##                      word  freq
## room                 room 10041
## bedroom           bedroom  8214
## private           private  7313
## apartment       apartment  6585
## cozy                 cozy  5051
## apt                   apt  4068
## studio             studio  4059
## brooklyn         brooklyn  4022
## the                   the  3882
## spacious         spacious  3769
## manhattan       manhattan  3385
## with                 with  3099
## park                 park  3086
## east                 east  3074
## sunny               sunny  2921
## and                   and  2870
## williamsburg williamsburg  2677
## beautiful       beautiful  2503
## near                 near  2346
## village           village  2297

A surprising finding shows that having the words “private”, “cozy”, and “sunny” are actually associated with lower median price. Rather, having the words “spacious” and “beautiful” have no obvious association with the median.

This comparison is inspired by the following study, which analyze the relationship between how food is described in restaurant menu and other variables e.g., price “Word Salad: Relating Food Prices and Descriptions” https://homes.cs.washington.edu/~nasmith/papers/chahuneau+gimpel+routledge+scherlis+smith.emnlp12.pdf

Conclusion

To sum up, price and location give a certain extent of influence to the market. Indeed, some side factors elements may also take into consideration. Despite sourcing the distance to the subway from the hostel as an additional criteria in analyzing the market situation, the result showed no strong relation where the hostels are located in. More importantly, it pays attention to see that the hostels with zero availability are generally with lower prices and fewer reviews. The reason behind the scenario may be those rooms are private rooms in major so that the price is the cheapest and they are not profitable to be listed on Airbnb that commission/fee is to be charged by Airbnb as the broker. That may possibly underestimate the popularity of those hostels.
Furthermore, it is noteworthy to see some particular names of hostels are more appealing to customers at their booking stage. To create the name of the hostel with attractive wordings, maybe an important concern for owners.

Key takeaway

The marketing direction is believed to be led by the room type offered and for a more lucrative business.

Limitation

The analysis is constrained by the limited availability of concrete attributes of occupancy rate resulting in the number of reviews being built as a proxy for it, thus posing the possibility of bias. Indeed, a single data source and the restricted time dimension is not taken into consideration in the dataset for a thorough and wider scope of analysis.